import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from plotnine import ggplot, aes, geom_point, labsJSC 370: Data Science II
Week 3: EDA and Data Visualization
Goals
- Load in large datasets stored locally and remotely (GitHub), check memory issues
- Understand what EDA is (and is not)
- Build a repeatable EDA workflow
- Make plots that reveal structure, not just decoration
- Avoid common visualization pitfalls
Data Science Pipeline
- We will work on the blue words today

What is Exploratory Data Analysis?
- EDA = iterative exploration and building modeling intuition
- Focus on data distributions, missingness, relationships, anomalies
- Create: summary stats, exploratory plots
- Output: questions, hypotheses, and next steps
Exploratory Data Analysis
EDA is the process of summarizing data
It should be the first step in your analysis pipeline, and it serves to:
- ✅ check data (import issues, outliers, missing values, data errors)
- 🧹 clean data (fix implausible values, remove missing if necessary, rename, etc.)
- \(\Sigma\) summary statistics of key variables (univariate and bivariate)
- 📊 basic plots and graphs
EDA Checklist
EDA Checklist:
- Formulate a question
- Read in the data
- Check the dimensions and headers and footers of the data
- Check the variable types in the data
- Take a closer look at some/all of the variables
- Validate with an external source
- Conduct some summary statistics to answer the initial question
- Make exploratory graphs
Case Study
We are going to use a dataset created from the National Center for Environmental Information NOAA/NCEI. The data are hourly measurements from weather stations across the continental U.S.

Formulate a Question
It is a good idea to first have a question such as:
- what weather stations reported the hottest and coldest daily temperatures?
- what day of the month was on average the hottest?
- is there covariation between temperature and humidity in my dataset?
Read in the data
There are several ways to read in data (some depend on the type of data you have):
- pandas.read_csv() for delimited files
- pandas.read_parquet() for parquet
- pandas.read_feather() for feather
- pandas.read_excel() for .xls and .xlsx
- pyarrow.dataset / dask for larger-than-memory workflows
Read in the data
We’ll use Python packages pandas, numpy, matplotlib, seaborn, plotnine
Example: read in local file
If your data are a gzipped delimited file (e.g., met_all.gz), pandas can read it directly.
met = pd.read_csv("met_all.gz", compression="gzip")
met.head(4)| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | ... | vis.dist.qc | vis.var | vis.var.qc | temp | temp.qc | dew.point | dew.point.qc | atm.press | atm.press.qc | rh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 690150 | 93121 | 2019 | 8 | 1 | 0 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 37.2 | 5 | 10.6 | 5 | 1009.9 | 5 | 19.881266 |
| 1 | 690150 | 93121 | 2019 | 8 | 1 | 1 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 35.6 | 5 | 10.6 | 5 | 1010.3 | 5 | 21.760980 |
| 2 | 690150 | 93121 | 2019 | 8 | 1 | 2 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 34.4 | 5 | 7.2 | 5 | 1010.6 | 5 | 18.482122 |
| 3 | 690150 | 93121 | 2019 | 8 | 1 | 3 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 33.3 | 5 | 5.0 | 5 | 1011.6 | 5 | 16.888622 |
4 rows × 30 columns
Example: read in github file
If your data are a gzipped delimited file but stored on github, use
url = "https://raw.githubusercontent.com/JSC370/JSC370-2026/main/data/met_all.gz"
met = pd.read_csv(url, compression="gzip")
met.head(4)| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | ... | vis.dist.qc | vis.var | vis.var.qc | temp | temp.qc | dew.point | dew.point.qc | atm.press | atm.press.qc | rh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 690150 | 93121 | 2019 | 8 | 1 | 0 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 37.2 | 5 | 10.6 | 5 | 1009.9 | 5 | 19.881266 |
| 1 | 690150 | 93121 | 2019 | 8 | 1 | 1 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 35.6 | 5 | 10.6 | 5 | 1010.3 | 5 | 21.760980 |
| 2 | 690150 | 93121 | 2019 | 8 | 1 | 2 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 34.4 | 5 | 7.2 | 5 | 1010.6 | 5 | 18.482122 |
| 3 | 690150 | 93121 | 2019 | 8 | 1 | 3 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 33.3 | 5 | 5.0 | 5 | 1011.6 | 5 | 16.888622 |
4 rows × 30 columns
Check the types of variables in our dataset
We can get the types of data (integer, float, character) from met.info. This also tells us the memory usage.
met.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2377343 entries, 0 to 2377342
Data columns (total 30 columns):
# Column Dtype
--- ------ -----
0 USAFID int64
1 WBAN int64
2 year int64
3 month int64
4 day int64
5 hour int64
6 min int64
7 lat float64
8 lon float64
9 elev int64
10 wind.dir float64
11 wind.dir.qc object
12 wind.type.code object
13 wind.sp float64
14 wind.sp.qc object
15 ceiling.ht float64
16 ceiling.ht.qc int64
17 ceiling.ht.method object
18 sky.cond object
19 vis.dist float64
20 vis.dist.qc object
21 vis.var object
22 vis.var.qc object
23 temp float64
24 temp.qc object
25 dew.point float64
26 dew.point.qc object
27 atm.press float64
28 atm.press.qc int64
29 rh float64
dtypes: float64(10), int64(10), object(10)
memory usage: 544.1+ MB
What counts as “very large” data on a laptop?
On laptops, “very large” usually means it stresses RAM and slows basic operations.
A dataset can be “very large” even with only a few million rows if it has:
- many columns
- lots of strings (
objectdtype) - expensive operations (joins, groupbys, sorting)
Practical rule-of-thumb
Think in terms of memory, not just rows × columns.
- Comfortable: under ~1 GB in RAM
- Large: ~1–5 GB (needs care: dtypes, filtering early)
- Very large: > ~5 GB or doesn’t fit in RAM → you need a different approach/tool
Measure size in pandas
- Calculate the total memory used by the DataFrame (convert bytes -> GB)
met.memory_usage(deep=True).sum() / 1024**3np.float64(1.401238577440381)
Measure size in pandas
- Find the biggest memory columns (this time in MB)
(met.memory_usage(deep=True).sort_values(ascending=False) / 1024**2).head(10)dew.point.qc 113.360548
temp.qc 113.360548
vis.var 113.360548
wind.type.code 113.360548
sky.cond 113.360548
ceiling.ht.method 113.360548
vis.dist.qc 103.298048
vis.var.qc 102.860548
wind.sp.qc 99.797945
wind.dir.qc 85.994595
dtype: float64
Why strings make data “big”
- In pandas, object columns (strings) are often the memory hog.
dtypestells is what kind of variables we have in a dataset.
met.dtypesUSAFID int64
WBAN int64
year int64
month int64
day int64
hour int64
min int64
lat float64
lon float64
elev int64
wind.dir float64
wind.dir.qc object
wind.type.code object
wind.sp float64
wind.sp.qc object
ceiling.ht float64
ceiling.ht.qc int64
ceiling.ht.method object
sky.cond object
vis.dist float64
vis.dist.qc object
vis.var object
vis.var.qc object
temp float64
temp.qc object
dew.point float64
dew.point.qc object
atm.press float64
atm.press.qc int64
rh float64
dtype: object
Big data can be read in more cautiously
If the file is really big (ours is borderline), we could read only the columns we need. This is where having a question in mind is useful. For ours, we probably just want station location (lat, lon), year, month, day, hour, temperature and humidity.
url = "https://raw.githubusercontent.com/JSC370/JSC370-2026/main/data/met_all.gz"
usecols = ["USAFID","lat","lon", "year", "month","day","hour","temp", "rh"]
met_small = pd.read_csv(
url,
compression="gzip",
usecols=usecols
)
met_small.head(4)| USAFID | year | month | day | hour | lat | lon | temp | rh | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 690150 | 2019 | 8 | 1 | 0 | 34.3 | -116.166 | 37.2 | 19.881266 |
| 1 | 690150 | 2019 | 8 | 1 | 1 | 34.3 | -116.166 | 35.6 | 21.760980 |
| 2 | 690150 | 2019 | 8 | 1 | 2 | 34.3 | -116.166 | 34.4 | 18.482122 |
| 3 | 690150 | 2019 | 8 | 1 | 3 | 34.3 | -116.166 | 33.3 | 16.888622 |
Big data can be read in more cautiously
Chunking in pandas means reading in a file in pieces so you don’t load the whole dataset into RAM at once.
url = "https://raw.githubusercontent.com/JSC370/JSC370-2026/main/data/met_all.gz"
chunks = pd.read_csv(url, compression="gzip", chunksize=200_000)
n = 0
temp_sum = 0.0
for ch in chunks:
temp_sum += ch["temp"].sum(skipna=True)
n += ch["temp"].notna().sum()
temp_sum / nnp.float64(23.590542469664527)
This computes the overall mean temperature without ever storing the full dataset:
- temp_sum accumulates the sum of temps across chunks.
- n accumulates how many non-missing temps you saw.
- temp_sum / n is the mean.
We’re basically doing: \(\bar{x} = \frac{\sum x_i}{\#\{x_i \text{ not missing}\}}\)
Check the data
Once we have loaded the data we should check the dimensions of the dataset. In pandas, the we use met.shape.
print("shape:", met.shape)
print("n_rows:", met.shape[0])
print("n_cols:", met.shape[1])shape: (2377343, 30)
n_rows: 2377343
n_cols: 30
Check the data: Summary
We see that there are 2,377,343 records of hourly temperature in August 2019 from all of the weather stations in the US. The data set has 30 variables.
We know that we are supposed to have hourly measurements of weather data for the month of August 2019 for the entire US. We should check that we have all of these components. Let us check:
- the year, month, hours
- the range of locations (latitude and longitude)
- temperature
- relative humidity
Check the data more closely
We can get a quick summary from met.describe.
met.describe()| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | wind.dir | wind.sp | ceiling.ht | ceiling.ht.qc | vis.dist | temp | dew.point | atm.press | atm.press.qc | rh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.377343e+06 | 2.377343e+06 | 2377343.0 | 2377343.0 | 2.377343e+06 | 2.377343e+06 | 2.377343e+06 | 2.377343e+06 | 2.377343e+06 | 2.377343e+06 | 1.592053e+06 | 2.297650e+06 | 2.256068e+06 | 2.377343e+06 | 2.296387e+06 | 2.317254e+06 | 2.311055e+06 | 711069.000000 | 2.377343e+06 | 2.310917e+06 |
| mean | 7.230995e+05 | 2.953885e+04 | 2019.0 | 8.0 | 1.599589e+01 | 1.134106e+01 | 3.956097e+01 | 3.794132e+01 | -9.214509e+01 | 4.158152e+02 | 1.850082e+02 | 2.459166e+00 | 1.616648e+04 | 5.027185e+00 | 1.492117e+04 | 2.359054e+01 | 1.702084e+01 | 1014.164390 | 7.728208e+00 | 7.164138e+01 |
| std | 2.156678e+03 | 3.336921e+04 | 0.0 | 0.0 | 8.917096e+00 | 6.860069e+00 | 1.724761e+01 | 5.171552e+00 | 1.195324e+01 | 5.737842e+02 | 9.230697e+01 | 2.147395e+00 | 9.282465e+03 | 1.249445e+00 | 3.879577e+03 | 6.059275e+00 | 6.206226e+00 | 4.059265 | 2.018170e+00 | 2.275361e+01 |
| min | 6.901500e+05 | 1.160000e+02 | 2019.0 | 8.0 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.455000e+01 | -1.242900e+02 | -1.300000e+01 | 3.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | -4.000000e+01 | -3.720000e+01 | 960.500000 | 1.000000e+00 | 8.334298e-01 |
| 25% | 7.209280e+05 | 3.706000e+03 | 2019.0 | 8.0 | 8.000000e+00 | 5.000000e+00 | 2.300000e+01 | 3.396700e+01 | -9.801800e+01 | 1.010000e+02 | 1.200000e+02 | 0.000000e+00 | 3.048000e+03 | 5.000000e+00 | 1.609300e+04 | 1.960000e+01 | 1.380000e+01 | 1011.800000 | 5.000000e+00 | 5.579015e+01 |
| 50% | 7.227280e+05 | 1.386000e+04 | 2019.0 | 8.0 | 1.600000e+01 | 1.100000e+01 | 5.000000e+01 | 3.835000e+01 | -9.170800e+01 | 2.520000e+02 | 1.800000e+02 | 2.100000e+00 | 2.200000e+04 | 5.000000e+00 | 1.609300e+04 | 2.350000e+01 | 1.810000e+01 | 1014.100000 | 9.000000e+00 | 7.655374e+01 |
| 75% | 7.250900e+05 | 5.476700e+04 | 2019.0 | 8.0 | 2.400000e+01 | 1.700000e+01 | 5.500000e+01 | 4.194000e+01 | -8.298600e+01 | 4.000000e+02 | 2.600000e+02 | 3.600000e+00 | 2.200000e+04 | 5.000000e+00 | 1.609300e+04 | 2.780000e+01 | 2.170000e+01 | 1016.400000 | 9.000000e+00 | 9.062916e+01 |
| max | 7.268130e+05 | 9.499800e+04 | 2019.0 | 8.0 | 3.100000e+01 | 2.300000e+01 | 5.900000e+01 | 4.894100e+01 | -6.831300e+01 | 9.999000e+03 | 3.600000e+02 | 3.600000e+01 | 2.200000e+04 | 9.000000e+00 | 1.600000e+05 | 5.600000e+01 | 3.600000e+01 | 1059.900000 | 9.000000e+00 | 1.000000e+02 |
- Note that this doesn’t tell us if there are any missing data!
Check the data more closely
It is a little harder to quickly summarize missing data in pandas, but here’s a way to count missing observations and calculate the percent missing.
missing = met.isna().sum().to_frame("n_missing")
missing["pct_missing"] = (missing["n_missing"] / len(met)).round(4)
missing.sort_values("pct_missing", ascending=False).head(20)| n_missing | pct_missing | |
|---|---|---|
| atm.press | 1666274 | 0.7009 |
| wind.dir | 785290 | 0.3303 |
| ceiling.ht | 121275 | 0.0510 |
| vis.dist | 80956 | 0.0341 |
| wind.sp | 79693 | 0.0335 |
| dew.point | 66288 | 0.0279 |
| rh | 66426 | 0.0279 |
| temp | 60089 | 0.0253 |
| lat | 0 | 0.0000 |
| day | 0 | 0.0000 |
| atm.press.qc | 0 | 0.0000 |
| year | 0 | 0.0000 |
| dew.point.qc | 0 | 0.0000 |
| month | 0 | 0.0000 |
| temp.qc | 0 | 0.0000 |
| vis.var.qc | 0 | 0.0000 |
| vis.var | 0 | 0.0000 |
| vis.dist.qc | 0 | 0.0000 |
| sky.cond | 0 | 0.0000 |
| lon | 0 | 0.0000 |
Check variables more closely
If we return to our initial question—what weather stations reported the hottest and coldest temperatures, we should take a closer look at our key variable, temperature (temp).
met["temp"].describe()count 2.317254e+06
mean 2.359054e+01
std 6.059275e+00
min -4.000000e+01
25% 1.960000e+01
50% 2.350000e+01
75% 2.780000e+01
max 5.600000e+01
Name: temp, dtype: float64
- It looks like the data are in Celsius.
Check variables more closely
- It’s odd to have -40C in August.
- Subset rows where temp == -40.0 and select a few columns with
loc, which is an indexer for selecting rows and columns.
met_ss = met.loc[met["temp"] == -40.0, ["USAFID","hour", "lat", "lon", "elev", "temp", "wind.sp"]]
met_ss.shape
met_ss.head()| USAFID | hour | lat | lon | elev | temp | wind.sp | |
|---|---|---|---|---|---|---|---|
| 496047 | 720717 | 0 | 29.117 | -89.55 | 36 | -40.0 | NaN |
| 496048 | 720717 | 0 | 29.117 | -89.55 | 36 | -40.0 | NaN |
| 496049 | 720717 | 0 | 29.117 | -89.55 | 36 | -40.0 | NaN |
| 496050 | 720717 | 1 | 29.117 | -89.55 | 36 | -40.0 | NaN |
| 496051 | 720717 | 1 | 29.117 | -89.55 | 36 | -40.0 | NaN |
- It appears that all of the data with -40C are from the same site and have missing. Points to a data issue.
Validate against an external source
We should check outside sources to make sure that our data make sense. The observation with -40°C is suspicious, so we should look up the location of the weather station.
Go to Google maps and enter the coordinates for the site with -40C (29.12, -89.55)

- It doesn’t make much sense to have a -40C reading in the Gulf of Mexico off the coast of Louisiana.
Filter implausible values
Filter with loc and sort the data with sort_values in one line:
met = met.loc[met["temp"] > -40].sort_values("temp")
met.head(20)| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | ... | vis.dist.qc | vis.var | vis.var.qc | temp | temp.qc | dew.point | dew.point.qc | atm.press | atm.press.qc | rh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1203054 | 722817 | 3068 | 2019 | 8 | 1 | 1 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203127 | 722817 | 3068 | 2019 | 8 | 3 | 11 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203128 | 722817 | 3068 | 2019 | 8 | 3 | 12 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203221 | 722817 | 3068 | 2019 | 8 | 6 | 21 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203224 | 722817 | 3068 | 2019 | 8 | 6 | 22 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203052 | 722817 | 3068 | 2019 | 8 | 1 | 0 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203225 | 722817 | 3068 | 2019 | 8 | 6 | 23 | 3 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 6 | NaN | 9 | NaN | 9 | NaN |
| 1203220 | 722817 | 3068 | 2019 | 8 | 6 | 21 | 39 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 6 | NaN | 9 | NaN | 9 | NaN |
| 1203222 | 722817 | 3068 | 2019 | 8 | 6 | 22 | 32 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 6 | NaN | 9 | NaN | 9 | NaN |
| 1203223 | 722817 | 3068 | 2019 | 8 | 6 | 22 | 41 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 6 | NaN | 9 | NaN | 9 | NaN |
| 1203053 | 722817 | 3068 | 2019 | 8 | 1 | 1 | 6 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1105458 | 722518 | 12974 | 2019 | 8 | 12 | 3 | 56 | 27.901 | -98.052 | 78 | ... | 9 | 9 | 9 | -17.0 | 1 | NaN | 9 | 1011.6 | 1 | NaN |
| 1105456 | 722518 | 12974 | 2019 | 8 | 12 | 1 | 56 | 27.901 | -98.052 | 78 | ... | 9 | 9 | 9 | -17.0 | 1 | NaN | 9 | 1010.3 | 1 | NaN |
| 1105454 | 722518 | 12974 | 2019 | 8 | 11 | 23 | 56 | 27.901 | -98.052 | 78 | ... | 9 | 9 | 9 | -17.0 | 1 | NaN | 9 | 1010.3 | 1 | NaN |
| 2370757 | 726764 | 94163 | 2019 | 8 | 27 | 11 | 50 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -3.0 | C | -5.0 | C | NaN | 9 | 86.265368 |
| 2370758 | 726764 | 94163 | 2019 | 8 | 27 | 12 | 10 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -3.0 | 5 | -4.0 | 5 | NaN | 9 | 92.910834 |
| 2370759 | 726764 | 94163 | 2019 | 8 | 27 | 12 | 30 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -3.0 | 5 | -4.0 | 5 | NaN | 9 | 92.910834 |
| 2370760 | 726764 | 94163 | 2019 | 8 | 27 | 12 | 50 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -3.0 | C | -4.0 | C | NaN | 9 | 92.910834 |
| 252488 | 720411 | 137 | 2019 | 8 | 18 | 12 | 35 | 36.422 | -105.290 | 2554 | ... | 5 | N | 5 | -2.4 | 5 | -3.7 | 5 | NaN | 9 | 90.914749 |
| 2370687 | 726764 | 94163 | 2019 | 8 | 26 | 12 | 30 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -2.0 | 5 | -3.0 | 5 | NaN | 9 | 92.966900 |
20 rows × 30 columns
- We should do a quick check of (38.767, -104.300) and (27.901, -98.052) to see if the -17C rows are also implausible and should be removed.
Filter and select
Back to question: what weather stations reported the hottest and coldest daily temperatures?
Compute the min and max hourly temperatures and the associated stations. Filter with loc and find min and max (it ignores missing) with idxmin() and inxmax()
min_row = met.loc[met["temp"].idxmin(), ["temp", "USAFID", "day"]]
max_row = met.loc[met["temp"].idxmax(), ["temp", "USAFID", "day"]]
min_row, max_row(temp -17.2
USAFID 722817
day 1
Name: 1203054, dtype: object,
temp 56.0
USAFID 720267
day 26
Name: 42402, dtype: object)
- Ok but we need daily temperatures to answer our questions!
Grouping and summarizing data
We need to transform our data to answer our initial question at the daily scale. Let’s find daily mean temperature by station and day. Use groupby and agg to calculate the mean and create a new dataset met_daily.
met_daily = (
met.groupby(["USAFID", "day"], as_index=False)
.agg(
temp=("temp", "mean"),
lat=("lat", "mean"),
lon=("lon", "mean"),
rh=("rh", "mean"),
)
.sort_values("temp")
)
met_daily.head(), met_daily.tail() ( USAFID day temp lat lon rh
20774 722817 3 -17.200000 38.767 -104.3 NaN
20773 722817 1 -17.133333 38.767 -104.3 NaN
20775 722817 6 -17.066667 38.767 -104.3 NaN
43605 726130 11 4.278261 44.270 -71.3 99.540440
43625 726130 31 4.304348 44.270 -71.3 99.710254,
USAFID day temp lat lon rh
27043 723805 5 40.975000 34.768000 -114.618000 11.369535
2345 720339 14 41.000000 32.146000 -111.171000 9.137857
27042 723805 4 41.183333 34.768000 -114.618000 14.211822
20623 722787 5 41.357143 33.527000 -112.295000 19.829822
30 690150 31 41.716667 34.299667 -116.165667 8.676717)
Grouping and summarizing data
What day of the month was the hottest?
min_daily = met_daily.loc[met_daily["temp"].idxmin(), ["temp", "USAFID", "day"]]
max_daily = met_daily.loc[met_daily["temp"].idxmax(), ["temp", "USAFID", "day"]]
min_daily, max_daily(temp -17.2
USAFID 722817.0
day 3.0
Name: 20774, dtype: float64,
temp 41.716667
USAFID 690150.000000
day 31.000000
Name: 30, dtype: float64)
- The hottest day was August 31st, and the average daily temperature was 41.7C
Looking at two variables at a time
To answer the last question about covariation between temperature and humidity, we can first calculate correlation
pearson = met["temp"].corr(met["rh"], method="pearson")
spearman = met["temp"].corr(met["rh"], method="spearman")
pearson, spearman, spearman - pearson(np.float64(-0.5464705142984516),
np.float64(-0.5461260667477443),
np.float64(0.00034444755070728306))
Is it the same for the daily data?
pearson = met_daily["temp"].corr(met_daily["rh"], method="pearson")
spearman = met_daily["temp"].corr(met_daily["rh"], method="spearman")
pearson, spearman, spearman - pearson(np.float64(-0.25729536769933825),
np.float64(-0.20801597675598085),
np.float64(0.0492793909433574))
The correlation is quite different between hourly and daily data. Why?
Exploratory graphs
With exploratory graphs we aim to:
• debug any issues remaining in the data
• understand properties of the data
• look for patterns in the data
• inform modeling strategies
Exploratory graphs do not need to be perfect, we’ll look at presentation-ready plots with python’s ggplot equivalent.
Exploratory graphs
What type of data are these types of exploratory graphs good for?
• histograms
• boxplots
• scatterplots
• barplots
• lineplots
• violin plots
• maps
Exploratory histogram
Focusing on temperature, let’s look at the distribution of hourly temperature (after removing -40°C) using a histogram.
plt.hist(met["temp"].dropna(), bins=50)
plt.xlabel("Temperature (°C)")
plt.ylabel("Count")
plt.title("Hourly Temperature")
plt.show()
Exploratory histogram
Let’s look at the daily data
plt.hist(met_daily["temp"].dropna(), bins=50)
plt.xlabel("Temperature (°C)")
plt.ylabel("Count")
plt.title("Daily Mean Temperature")
plt.show()
Exploratory boxplot
A boxplot gives us an idea of the quantiles of the distribution and any outliers.
plt.boxplot(met["temp"].dropna(), vert=True)
plt.ylabel("Temperature (°C)")
plt.title("Hourly temperature")
plt.show()
Exploratory boxplot
Often boxplots are better for grouped continuous data, such as temperature by hour (over all days in the dataset).
hours = range(24)
data_by_hour = [met.loc[met["hour"] == h, "temp"].to_numpy() for h in hours]
plt.boxplot(data_by_hour, labels=list(hours), showfliers=False)
plt.xlabel("Hour")
plt.ylabel("Temperature (°C)")
plt.title("Temperature by Hour")
plt.show()
Exploratory violin plot
Often violin plots are helpful for grouped continuous data (they show the shape of the distribution), such as temperature by hour (over all days in the dataset).
plt.violinplot(data_by_hour, showmeans=False, showmedians=True, showextrema=False)
plt.xticks(ticks=np.arange(1, 25), labels=list(hours))
plt.xlabel("Hour")
plt.ylabel("Temperature (°C)")
plt.title("Temperature by Hour")
plt.show()
Exploratory scatterplot
Let’s check covariation with a scatterplot of humidity vs temperature.
plt.scatter(met["temp"], met["rh"])
plt.xlabel("Temperature (°C)")
plt.ylabel("Relative humidity (%)")
plt.title("Temperature vs humidity (sample)")
plt.show()
Exploratory scatterplot
Since there are so many points let’s take a random sample of 10,000 rows using sample and add alpha transparency to see the scatter better. random_state is like setting a seed for reproducibility.
met_s = met.sample(min(10_000, len(met)), random_state=1)
plt.scatter(met_s["temp"], met_s["rh"], alpha=0.2)
plt.xlabel("Temperature (°C)")
plt.ylabel("Relative humidity (%)")
plt.title("Temperature vs humidity (sample)")
plt.show()
Exploratory map

Exploratory map
May need to python3 -m pip install geopandas contextily to make these maps
import geopandas as gpd
import contextily as cx
# extract latitudes and longitudes
gdf = gpd.GeoDataFrame(
met_daily.dropna(subset=["lat", "lon"]).copy(),
geometry=gpd.points_from_xy(met_daily["lon"], met_daily["lat"]),
crs="EPSG:4326"
)
gdf_web = gdf.to_crs(epsg=3857)
ax = gdf_web.plot(markersize=10, alpha=0.6)
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)
ax.set_axis_off()
plt.show()Interactive map
- Slightly more advanced to make an interactive map with
folium. - You can use
leafletas another (my preferred) option, but it seems to have some problems in quarto.
Interactive map
This is how folium works:
- select unique lat and lon
- set up the map with
folium.Map - add the location points on the map with
folium.CircleMarker, add opacity and radius for point size here
Interactive map
import folium
pts = met_daily[["lat", "lon"]].dropna().drop_duplicates()
m = folium.Map(
location=[pts["lat"].mean(), pts["lon"].mean()],
zoom_start=4,
tiles="CartoDB positron"
)
for _, r in pts.iterrows():
folium.CircleMarker(
location=[r["lat"], r["lon"]],
radius=2,
fill=True,
fill_opacity=0.8,
opacity=0.8,
).add_to(m)
mExploratory map
Let’s now look at where the hottest and coldest places were based on the daily temperatures. We add a layer for cold and a layer for hot.
cold = met_daily.loc[met_daily["temp"].idxmin()]
hot = met_daily.loc[met_daily["temp"].idxmax()]
cold, hot
center = [met_daily["lat"].mean(), met_daily["lon"].mean()]
m = folium.Map(location=center, zoom_start=4, tiles="CartoDB positron")
# Coldest (blue)
folium.CircleMarker(
location=[cold["lat"], cold["lon"]],
radius=10,
popup=f"Coldest daily mean: {cold['temp']:.2f}°C<br>USAFID={cold['USAFID']}<br>day={cold['day']}",
color="blue",
fill=True,
fill_color="blue",
fill_opacity=0.9,
).add_to(m)
# Hottest (red)
folium.CircleMarker(
location=[hot["lat"], hot["lon"]],
radius=10,
popup=f"Hottest daily mean: {hot['temp']:.2f}°C<br>USAFID={hot['USAFID']}<br>day={hot['day']}",
color="red",
fill=True,
fill_color="red",
fill_opacity=0.9,
).add_to(m)
mAdding a bit more information to exploratory graphs
- We have seen the concept of layering data in the maps.
- We can add more information to a lot of different types of plots. For example, a scatterplot shows individual observations, but it can be hard to see relationships.
- A common EDA move is to add a regression line to summarize how
rhchanges withtemp
plotnine (ggplot-style plots in Python)
If you like ggplot2, plotnine uses the same “add layers with +” workflow: - map variables with aes() - add geoms (points, smoothers) - add labels/themes
import numpy as np
from sklearn.metrics import r2_score
from plotnine import (
ggplot, aes, geom_point, geom_smooth, labs, theme_bw, annotate, coord_cartesian
)
met_s = met[["temp", "rh"]].dropna().sample(min(10_000, len(met)), random_state=1)
# Fit line: rh = m*temp + b
x = met_s["temp"].to_numpy()
y = met_s["rh"].to_numpy()
m, b = np.polyfit(x, y, 1)
y_hat = m * x + b
r2 = r2_score(y, y_hat)
label = f"rh = {m:.2f}·temp + {b:.2f}\nR² = {r2:.3f}"
# location for the label
x_pos = np.quantile(x, 0.9)
y_pos = np.quantile(y, 0.999)
(
ggplot(met_s, aes(x="temp", y="rh"))
+ geom_point(alpha=0.2, size=1.0)
+ geom_smooth(method="lm", se=False)
+ annotate("text", x=x_pos, y=y_pos, label=label, ha="left")
+ coord_cartesian(ylim=(0, 100))
+ labs(
title="Temperature vs humidity (sample) with linear fit",
x="Temperature (°C)",
y="Relative humidity (%)"
)
+ theme_bw()
)
plotnine
(
ggplot(met_s, aes(x="temp", y="rh"))
+ geom_point(alpha=0.2, size=1.0)
+ geom_smooth(method="loess", se=False)
+ coord_cartesian(ylim=(0, 100))
+ labs(
title="Temperature vs humidity (sample) with linear fit",
x="Temperature (°C)",
y="Relative humidity (%)"
)
+ theme_bw()
)